Self-supervised Learning

Design a proxy task using unlabeled or weakly-labeled data to help the original task. Essentially, self-supervised learning is multi-task learning with the proxy task not relying on heavy human annotation. The problem is which proxy task without human annotation is the most effective one.

Please refer to the tutorial slides [1] [2], the survey, and the paper list.

  1. image-to-image

    • image-to-image translation: colorization [1], inpainting [2], cross-channel generation [3]
    • spatial location: relative location [1], jigsaw [2], predicting rotation [3]
    • contrastive learning: instance-wise contrastive learning (e.g., MOCO), prototypical contrastive learning (clustering) [1] [2]
    • MAE: Siamese MAE
  1. video-to-image

    • temporal coherence: [1] [2] [3]
    • temporal order: [1] [2] [3]
    • unsupervised image tasks with video clues: clustering [1], optical flow prediction [1], unsupervised segmentation based on optical flow [1] [2],unsupervised depth estimation based on optical flow [2]
    • video generation [1]
    • cross-modal consistency: consistency between visual kernel and optical flow kernel [1]
  2. video-to-video: all video-to-image methods can be used for video-to-video by averaging frame features.

    • 3D rotation [1]
    • Cubic puzzle [1]
    • video localization and classification [1]

Muti-task self-supervised learning: integrate multiple proxy tasks [1] [2]

Combined with other frameworks: self-supervised GAN [1]

A recent paper [1*] claims that the best self-supervised learning method is still the earliest image inpainting model. The design of network architecture has a significant impact on the performance of self-supevivsed learning methods.

SimCLR [2*] is a SOTA self-supervised learning method with performance approaching supervised learning.

Reference

[1*] Alexander Kolesnikov, Xiaohua Zhai, Lucas Beyer: Revisiting Self-Supervised Visual Representation Learning. CVPR 2019.

[2*] Chen, Ting, et al. “A simple framework for contrastive learning of visual representations.” arXiv preprint arXiv:2002.05709 (2020).